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Abstract 

By developing data augmentation methods unique to the negative binomial (NB) 
distribution, we unite seemingly disjoint count and mixture models under the NB 
process framework. We develop fundamental properties of the models and derive 
efficient Gibbs sampling inference. We show that the gamma-NB process can 
be reduced to the hierarchical Dirichlet process with normalization, highlighting 
its unique theoretical, structural and computational advantages. A variety of NB 
processes with distinct sharing mechanisms are constructed and applied to topic 
modeling, with connections to existing algorithms, showing the importance of 
inferring both the NB dispersion and probability parameters. 



1 Introduction 

There has been increasing interest in count modeling using the Poisson process, geometric process 
|Q][2]|3]|4) an d recently the negative binomial (NB) process Notably, it has been independently 

shown in |5 | and [6] that the NB process, originally constructed for count analysis, can be naturally 
applied for mixture modeling of grouped data x\, ■ ■ ■ ,xj, where each group Xj = {xji}i = i,N j . 
For a territory long occupied by the hierarchical Dirichlet process (HDP) Q and related models, the 
inference of which may require substantial bookkeeping and suffer from slow convergence Q, the 
discovery of the NB process for mixture modeling can be significant. As the seemingly disjoint count 
and mixture modelings are united under the NB process framework, new opportunities emerge for 
better data fitting, more efficient inference and more flexible model constructions. However, neither 
[5 1 nor |6| explore the properties of the NB distribution deep enough to achieve fully tractable 
closed-form inference. Of particular concern is the NB dispersion parameter, which was simply 
fixed or empirically set [6|, or inferred with a Metropolis-Hastings algorithm Q. Under these 
limitations, both papers fail to reveal the connections of the NB process to the HDP, and thus may 
lead to false assessments on comparing their modeling abilities. 

We perform joint count and mixture modeling under the NB process framework, using completely 
random measures flT] IS [9j that are simple to construct and amenable for posterior computation. 
We propose to augment-and-conquer the NB process: by "augmenting" a NB process into both 
the gamma-Poisson and compound Poisson representations, we "conquer" the unification of count 
and mixture modeling, the analysis of fundamental model properties, and the derivation of efficient 
Gibbs sampling inference. We make two additional contributions: 1) we construct a gamma-NB 
process, analyze its properties and show how its normalization leads to the HDP, highlighting its 
unique theoretical, structural and computational advantages relative to the HDP. 2) We show that 
a variety of NB processes can be constructed with distinct model properties, for which the shared 
random measure can be selected from completely random measures such as the gamma, beta, and 
beta-Bernoulli processes; we compare their performance on topic modeling, a typical example for 
mixture modeling of grouped data, and show the importance of inferring both the NB dispersion and 
probability parameters, which respectively govern the overdispersion level and the variance-to-mean 
ratio in count modeling. 
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1.1 Poisson process for count and mixture modeling 

Before introducing the NB process, we first illustrate how the seemingly distinct problems of count 
and mixture modeling can be united under the Poisson process. Denote fl as a measure space and for 
each Borel set 4 C f!, denote Xj (A) as a count random variable describing the number of observa- 
tions in Xj that reside within A. Given grouped data x\, ■ ■ ■ , ccjy, for any measurable disjoint par- 
tition Ai, ■ ■ ■ , Aq of f2, we aim to jointly model the count random variables {Xj(A q )}. A natural 
choice would be to define a Poisson process Xj ~ PP(G), with a shared completely random mea- 
sure G on fl, such that Xj (A) ~ Pois(G(A)) for each ic!!. Denote G(fi) = J2 q =i G(A q ) and 

G = G/G(fi). Following Lemma 4.1 of O, the joint distributions of Xj(Q), Xj{A{), ■ ■ ■ ,Xj{A Q ) 
are equivalent under the following two expressions: 

Xm = E%iXj(A q ), Xj(A q ) ~ Pois(G(A 9 )); (1) 
X,-(ft)~Poisson(G(fl)), [Xj(Ai),--- ,Xj(A q )]^Malt(Xj(Q,);G(Ai),--- ,G(Aq)). (2) 

Thus the Poisson process provides not only a way to generate independent counts from each A q , 
but also a mechanism for mixture modeling, which allocates the observations into any measurable 
disjoint partition {^4 9 }i,q of SI, conditioning on Xj(Q) and the normalized mean measure G. 

To complete the model, we may place a gamma process [9] prior on the shared measure as 
G ~ GaP(c, Go), with concentration parameter c and base measure Go, such that G(A) ~ 
Gamma(Go(A), 1/c) for each icH, where Go can be continuous, discrete or a combination of 
both. Note that G = G/G(S1) now becomes a Dirichlet process (DP) as G^DP(7o,Go), where 
7o = Go(f2) and Go = Go/70. The normalized gamma representation of the DP is discussed in 
lfT0lfTTl l9l and has been used to construct the group-level DPs for an HDP [12]. The Poisson process 
has an equal-dispersion assumption for count modeling. As shown in the construction of Poisson 
processes with a shared gamma process mean measure implies the same mixture proportions across 
groups, which is essentially the same as the DP when used for mixture modeling when the total 
counts {Xj(£l)}j are not treated as random variables. This motivates us to consider adding an ad- 
ditional layer or using a different distribution other than the Poisson to model the counts. As shown 
below, the NB distribution is an ideal candidate, not only because it allows overdispersion, but also 
because it can be augmented into both a gamma-Poisson and a compound Poisson representations. 

2 Augment-and-Conquer the Negative Binomial Distribution 

The NB distribution m ~ NB(r,p) has the probability mass function (PMF) /a4"(W) = i^Tf^y (1 — 

p) r p m . It has a mean p, = rp/ (1 — p) smaller than the variance a 2 = rp/(l — p) 2 = \i + r _1 /i 2 , 
with the variance-to-mean ratio (VMR) as (1 — p)^ 1 and the overdispersion level (ODL, the coeffi- 
cient of the quadratic term in a 2 ) as r _1 . It has been widely investigated and applied to numerous 
scientific studies lfl3l [141 [T5l [T6l . The NB distribution can be augmented into a gamma-Poisson 
construction ifTTl as m ~ Pois(A), A ~ Gamma (r, p/(l — p)), where the gamma distribution is 
parameterized by its shape r and scale p/(l — p). It can also be augmented under a compound 

Poisson representation lfl"8l as m — X)t=i u *j u t ~ Log(p), I ~ Pois(— rln(l — p)), where 
u ~ Log(p) is the logarithmic distribution |[T9l l20l with probability-generating function (PGF) 
Cjjiz) = ln(l — pz)/ln(l — p), \z\ < p" 1 . In a slight abuse of notation, but for added concise- 
ness, in the following discussion we use m ~ 2t=i Log(p) to denote m = Ylt=i u *' Ut ~ ^og(p). 

The inference of the NB dispersion parameter r has long been a challenge iTPJl I2T1 l22l 16| . In this 
paper, we first place a gamma prior on it as r ~ Gamma^ , 1 jc\ ) . We then use Lemma 2.1| (below) 
to infer a latent count I for each m ~ NB(r, p) conditioning on m and r. Since I ~ Pois(— r ln(l 



p)) by construction, we can use the gamma Poisson conjugacy to update r. Using Lemma 2.2 



(below), we can further infer an augmented latent count I' for each I, and then use these latent counts 



to update r\, assuming r\ ~ Gamma(r2, 1/C2). Using Lemmas 2.1 and 2.2 we can continue this 
process repeatedly, suggesting that we may build a NB process to model data that have subgroups 
within groups. The conditional posterior of the latent count I was first derived by us but was not 
given an analytical form [23]. Below we explicitly derive the PMF of I, shown in and find that 
it exactly represents the distribution of the random number of tables occupied by m customers in a 
Chinese restaurant process with concentration parameter r iflOl l24l 171 . We denote I ~ CRT(m, r) 
as a Chinese restaurant table (CRT) count random variable with such a PMF and as proved in the 
supplementary material, we can sample it as I = 5Z n=1 b n , b n ~ Bernoulli (r/(n— 1 + r)). 
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haveS r (m) = (m-l+r)S r (m-l) = ■ ■ ■ = Un=i(r + n)S r (l) = Un=o( r + n ) = T^- n 
Lemma 2.2. Let m ~ NB(r,p), r ~ Gamma(ri, 1/ci), denote p' = - ~Tr7rzgv > TO caw a ' ,so 



Both the gamma-Poisson and compound-Poisson augmentations of the NB distribution and Lemmas 
[XT] and [Z2] are key ingredients of this paper. We will show that these augment-and-concur methods 
not only unite count and mixture modeling and provide efficient inference, but also, as shown in 
Section [3] let us examine the posteriors to understand fundamental properties of the NB processes, 
clearly revealing connections to previous nonparametric Bayesian mixture models. 
Lemma 2.1. Denote s(m,j) as Stirling numbers of the first kind [191. Augment m ~ NB(r,p) 

under the compound Poisson representation as m ~ Et=i Log(p), I ~ Pois(— rln(l — p)), then 
the conditional posterior of I has PMF 

Pr(l=j\m,r) = r( ^g r) \s(m, j)\r j , j = 0, 1, • • • , m. (3) 

Proof. Denote Wj ~ E*=iL°g(p)> j = 1, • • • , to. Since Wj is the summation of j iid Log(p) 

random variables, the PGF of Wj becomes Cwj{z) — C\j(z) = [ln(l — pz)/ln(l — p)] 3 , \z\ < 

p^ 1 . Using the property that [ln(l + x)] J — j?!E5^Lj ^"nf^ ED' we nave P r ( u; j — m ) — 

C^(0)/m\ = (-l) m p> j\s(m, j)/(m![ln(l - p)p'). Thus for < j < to, we have Pi(L = 

j\m,r) (xPr(wj = m)Pois(j; — r ln(l— p)) cx \s(m,j)\r J . Denote S r (m) — Ejlo \ s ( m ij)\ r3 ^ we 

_ r 

r(r) 

ci— ln(l— p)' 

be generated from a compound distribution as 

m ~ EL L °g(p)> 1 ~ EtLiLog^'). i' ~ Pois(-nln(l -p')). (4) 
Proof. Augmenting m leads to m ~ Et=i Log(p), / ~ Pois(— r ln(l — p)). Marginalizing out r 
leads to I ~ NB (r% , p'). Augmenting I using its compound Poisson representation leads to (Eb. □ 

3 Gamma-Negative Binomial Process 

We explore sharing the NB dispersion across groups while the probability parameters are group 
dependent. We define a NB process X ~ NBP(G,p) asX(A) ~ NB(G(A),p) for each A C fi and 
construct a gamma-NB process for joint count and mixture modeling as Xj ~ NBP(G,p,-), G ~ 
GaP(c, Go), which can be augmented as a gamma-gamma-Poisson process as 

J^~PP(A,-), Aj ~ GaP((l — Pj)/pj, G), G~GaP(c,G ). (5) 
In the above PP() and GaP(-) represent the Poisson and gamma processes, respectively, as defined 
in Section [TT| Using Lemma |X2) the gamma-NB process can also be augmented as 

X 3 ~ EfiiLogfe), Li ~ PP(-Gln(l - Pj )), G ~ GaP(c,G ); (6) 
i = Erf ^ ^ E^i Log(p'), V ~ PP(-G ln(l - p' j), p> = =^§^j - (1) 

These three augmentations allow us to derive a sequence of closed-form update equations for infer- 
ence with the gamma-NB process. Using the gamma Poisson conjugacy on Q, for each A C £1, we 
have Aj(A)\G, Xj,pj ~ Gamma (G(A) + Xj(A),pj), thus the conditional posterior of Aj is 

A i |G,X J -,p i ~GaP(l/p i ,G + .X J -). (8) 
Define T ~ CRTP(X, G) as a CRT process that T{A) = E^eA T ( w ): T H ~ CRT (X(u>),G(u>)) 
for each icSl. Applying Lemma |2~T| on |6]) and |7]), we have 

Lj|X,, G - CRTP(JCj,G), L'\L, G - CRTP(L, G ). (9) 
If Go is a continous base measure and 70 = Gq(£1) is finite, we have Gq{uj) — » V cj e O and thus 

z/(n)|L,G„ = E wen ^H > 0) = E w6n *(Ei-XiH > o) (io) 

which is equal to K + , the total number of used discrete atoms; if Go is discrete as Go = 
Ef=i i<W then L'(wfc) = CRT(i(w fc ),f ) > 1 if Ej ^K) > 0, thus L'(fi) > In 
either case, let 70 ~ Gamma(eo, l//o), with the gamma Poisson conjugacy on (|6jl and 0, we have 

70 1 {V (O), j/} ^ Gamma (e + £'(»), /o _ ln j 1 _ p , ) ); (H) 

G|G , {Lj-.ft} ~ GaP(c - Ej Ml - Pi), Go + E, (12) 
Since the data {xji}i are exchangeable within group j, the predictive distribution of a point Xji, 
conditioning on XJ 1 — {Xj n } n:n ^i and G, with Aj marginalized out, can be expressed as 

^^l^'^-j E[Aj(!i)|G,i-] _ G(n)+x 3 -(n)-i + G(n)+x 3 (o)-i- ^-^ 
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3.1 Relationship with the hierarchical Dirichlet process 

Using the equivalence between ([T]i and Q and normalizing all the gamma processes in |5]), denoting 

Aj = Aj/Aj-(f2), a = G(fi), G — G/a, 70 = Gq(Q) and Go = Go/70, we can re-express ^ as 

Xji ~ Aj, Aj ~ DP(a, G), a - Gamma( 7o , 1/c), G - DP( 7o , G ) (14) 

which is an HDP 0. Thus the normalized gamma-NB process leads to an HDP, yet we can- 
not return from the HDP to the gamma-NB process without modeling Xj(fl) and Aj(fi) as ran- 
dom variables. Theoretically, they are distinct in that the gamma-NB process is a completely 
random measure, assigning independent random variables into any disjoint Borel sets {^4 9 }i,q 
of O; whereas the HDP is not. Practically, the gamma-NB process can exploit conjugacy to 
achieve analytical conditional posteriors for all latent parameters. The inference of the HDP is 
a major challenge and it is usually solved through altervative constructions such as the Chinese 
restaurant franchise (CRF) and stick-breaking representations |7l|25|. In particular, without an- 
alytical conditional posteriors, the inference of concentration paramters a and 70 is non-trival 
El |26l and they are often simply fixed ll25ll . Under the CRF metaphor a governs the random 
number of tables occupied by customers in each restaurant independently; further, if the base 
probability measure Go is continous, 70 governs the random number of dishes selected by tables 
of all restaurants. One may apply the data augmentation method of l24l to sample a and 70. 
However, if Go is discrete as Go = X^frLi If'W' which is of pratical value and becomes a con- 
tinous base measure as K 00 ifTTT TT 26 1, then using the method of E4l to sample 70 is only 
approximately correct, which may result in a biasd estimate in practice, espcially if K is not large 
enough. By constrast, in the gamma-NB process, the shared gamma process G can be analytically 
updated with (JT2J and G(f2) plays the role of a in the HDP, which is readily available as 

G(n)\G ,{Lj, Pj }j=i, N ~ Gamma(7o + Ej Hn), 5=^71^) ) ( 15 > 

and as in (jTTJ, regardless of whether the base measure is continous, the total mass 70 has an analyti- 
cal gamma posterior whose shape parameter is governed by L'(fi), with Z/(f2) = K + if Go is con- 
tinous and finite and L'(Q) > K + if Go = Ylk=i if *-W- Equation |l5| also intuitively shows how 
the NB probability parameters {pj} govern the variations among {Aj} in the gamma-NB process. 
In the HDP, pj is not explicitly modeled, and since its value becomes irrelevant when taking the nor- 
malized constructions in ( p"4| i, it is usually treated as a nuance parameter and perceived as pj = 0.5 
when needed for interpretation purpose. Fixing pj = 0.5 is also considered in lfl2l to construct a 
HDP, whose group-level DPs are normalized from gamma processes with the scale parameters as 
= 1; it is also shown in [ 12 1 that improved performance can be obtained for topic modeling by 
learning the scale parameters with a log Gaussian process prior. However, no analytical conditional 
posteriors are provided and Gibbs sampling is not considered as a viable option lfl2ll . 

3.2 Augment-and-conquer inference for joint count and mixture modeling 

For a finite continuous base measure, the gamma process G ~ GaP(c, Go) can also be defined 
with its Levy measure on a product space M. + x il, expressed as v(drdui) = r _1 e~ cr drGo{du>) J9). 
Since the Poisson intensity v + = u(R + x fi) = 00 and J J R+xn ri/(drdoj) is finite, a draw from this 

process can be expressed as G = X^fcLi r kSui k , (f*fe; k"fc) ~ Tr(drduj), n{drduj)v + = v(drduS) (9)- 

Here we consider a discrete base measure as Go = SfcLi ~k > UJk ~ 5o( w fc), then we have G = 

Tl,k=i r fc^k' r k ~ Gamma(7o/if, l/c),Wfe ~ go{wk), which becomes a draw from the gamma 
process with a continuous base measure as K — > 00. Let Xji ~ F(uj Zji ) be observation i in group j, 

linked to a mixture component uj Zji G Q through a distribution F. Denote rij^ = $( z ji = 

we can express the gamma-NB process with the discrete base measure as 

uj k ~ ffo(wfc), Nj = J2k=i n ife' n jk ~ Pois(Aj7c), Xjk ~ Gamma(rfc,pj/(1 ~Pj)) 

r k ~ Gamma(7o/i ; i', 1/c), p^ - Beta(a , b ), 70 ~ Gamma(e , l// ) (16) 

where marginally we have rijk ~ NB(rk,Pj). Using the equivalence between (1) and (2), we 
can equivalently express Nj and rijk in the above model as Nj ~ Pois (Aj) , [riji,--- ,njK\ ~ 

Mult {Nj) Xji/\j, ■ ■ ■ , Ajif/Aj), where Aj = Ylk=i^jk' Since the data {xji}i—i^. are fully 
exchangeable, rather than drawing [rij±, ■ ■ ■ , rijK] once, we may equivalently draw the index 

Zj l - Discrete (Xji/Xj,- ■ ■ , X jK /Xj) (17) 
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for each Xji and then let rij k = Yli=i $( z ji = This provides further insights on how the seem- 
ingly disjoint count and mixture modelings are seamlessly united under the NB process framework. 
Following ([8|-([T2|, the block Gibbs sampling is straightforward to write as 

p(wfc|-) oc l\ Zjz=k F(xji; u k )g (u> k ), Pr(z ii = fc|-) oc F(xji;u) k )\ jk 
Oil—) ~ Betafao + Nj,b + Efc r fc^ P' = c-E^Mi-pj) > ( l jkh) ~ CWT(n jk , r k ) 
(l' k \-) ~ CRT(E, ljk,lo/K), (70I-) ~ Gamma^o + E k l' k , 

(r k \-) ~ Gmxma^y /K + Ijk, ' ( A J fe H ~ Gamma(r fc + n jk ,pj). (18) 

which has similar computational complexity as that of the direct assignment block Gibbs sampling 
of the CRF-HDP [7, 26|. If go( w ) is conjugate to the likelihood F(x;u), then the posterior p(oj\—) 
would be analytical. Note that when K — > 00, we have (l' k \— ) = S(l k > 0) = <5(Sj n jk > 0). 

Using (1) and (2) and normalizing the gamma distributions, ( fT6| ) can be re-expressed as 

zji ~ Discrete (Aj), Aj ~ Dir(af), a ~ Gamma(7 , 1/c), f ~ Dir(7o/AT, • • • , Jo/K) (19) 

which looses the count modeling ability and becomes a finite representation of the HDP, the infer- 
ence of which is not conjugate and has to be solved under alternative representations Q|26). This 
also implies that by using the Dirichlet process as the foundation, traditional mixture modeling may 
discard useful count information from the beginning. 

4 The Negative Binomial Process Family and Related Algorithms 

The gamma-NB process shares the NB dispersion across groups. Since the NB distribution has two 
adjustable parameters, we may explore alternative ideas, with the NB probability measure shared 
across groups as in [6 1, or with both the dispersion and probability measures shared as in [5 |. These 
constructions are distinct from both the gamma-NB process and HDP in that Aj has space dependent 

scales, and thus its normalization Aj = Aj/Aj(fl) no longer follows a Dirichlet process. 

It is natural to let the probability measure be drawn from a beta process ll27l|28l , which can be de- 
fined by its Levy measure on a product space [0, 1] x £1 as v(dpduj) — cp _1 (l — p) c ^ 1 dpBo(duj). 
A draw from the beta process B ~ BP(c, Bq) with concentration parameter c and base measure 
B can be expressed as B = YlkLi Pk^ui k ■ A beta-NB process 15,61 can be constructed by letting 
Xj ~ NBP(rj, B), with a random draw expressed as Xj = Y^kLi n jk^ui k , n jk ~ NB(rj,pfc). 
Under this construction, the NB probability measure is shared and the NB dispersion parameters 
are group dependent. As in Q, we may also consider a marked-beta-NB^ process that both the 
NB probability and dispersion measures are shared, in which each point of the beta process is 
marked with an independent gamma random variable. Thus a draw from the marked-beta pro- 
cess becomes (R,B) = X^feLi( r fciPfc)^fc' an ^ tne NI5 process Xj ~ NBP(_R, B) becomes 
Xj = J2 < k > =i n jk3ujk' 71 jk ~ NB(rfc,pfc). Since the beta and NB processes are conjugate, the 
posterior of B is tractable, as shown in |j5] [6j. If it is believed that there are excessive number 
of zeros that are governed by a separate process other than the NB process, we may introduce 
a zero inflated NB process as Xj ~ NBP(RZj,pj), where Zj ~ BeP(B) is drawn from the 
Bernoulli process EB1 and (R, B) = YL < k=ii r ki 7r fc)^w fc is drawn from a marked-beta process, thus 
njk ~ NB(rfcbjjfe,j3j), bj k — Bernoulli (iv k ). This construction can be linked to the model in ||29| 
with appropriate normalization, with advantages that there is no need to fix pj — 0.5 and the infer- 
ence is fully tractable. The zero inflated construction can also be linked to models for real valued data 
using the Indian buffet process (IBP) or beta-Bernoulli process spike-and-slab prior 130 3Tll32l[33l . 

4.1 Related Algorithms 

To show how the NB processes can be diversely constructed and to make connections to previous 
parametric and nonparametric mixture models, we show in Table[TJa variety of NB processes, which 
differ on how the dispersion and probability measures are shared. For a deeper understanding on 
how the counts are modeled, we also show in Table [TJ both the VMR and ODL implied by these 

'We may also consider a beta marked-gamma-NB process, whose performance is found to be very similar. 
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Table 1 : A variety of negative binomial processes are constructed with distinct sharing mechanisms, reflected 
with which parameters from rk, rj, pk, Pj and irk (bjk) are inferred (indicated by a check-mark /), and the 
implied VMR and ODL for counts {njk}j.k- They are applied for topic modeling of a document corpus, a 
typical example of mixture modeling of grouped data. Related algorithms are shown in the last column. 



Algorithms 


Tk 




Pk 


Pi 


7Tfe 


VMR 


ODL 


Related Algorithms 


NB-LDA 




/ 




/ 








LDA[34|,Dir-PFA|5| 


NB-HDP 


/ 






0.5 




2 


r-k' 


HDP|7|,DILN-HDP|12| 


NB-FTM 


/ 






0.5 


/ 


2 


(n) L bj k 


FTM [29|, S7r-PFA |5| 


Beta-NB 




/ 


/ 






(l-Pfc)- 1 




BNBP [5|, BNBP |6| 


Gamma-NB 


/ 






/ 






r k 


CRF-HDP 1711261 


Marked-Beta-NB 


/ 




/ 






(l-Pfc)- 1 




BNBP [5| 



settings. We consider topic modeling of a document corpus, a typical example of mixture mod- 
eling of grouped data, where each a-bag-of-words document constitutes a group, each word is an 
exchangeable group member, and F{xji, LUk) is simply the probability of word Xji in topic Wfc. 

We consider six differently constructed NB processes in Table [T] (/) Related to latent Dirichlet 
allocation (LDA) |34| and Dirichlet Poisson factor analysis (Dir-PFA) Q, the NB-LDA is also a 
parametric topic model that requires tuning the number of topics. However, it uses a document de- 
pendent rj and pj to automatically learn the smoothing of the gamma distributed topic weights, and 
it lets rj ~ Gamma(7o, 1/c), 70 ~ Gamma(eo, l//o) to share statistical strength between docu- 
ments, with closed-form Gibbs sampling inference. Thus even the most basic parametric LDA topic 
model can be improved under the NB count modeling framework, (ii) The NB-HDP model is re- 
lated to the HDP [7 1, and since pj is an irrelevant parameter in the HDP due to normalization, we set 
it in the NB-HDP as 0.5, the usually perceived value before normalization. The NB-HDP model is 
comparable to the DILN-HDP [ 12] that constructs the group-level DPs with normalized gamma pro- 
cesses, whose scale parameters are also set as one. (Hi) The NB-FTM model introduces an additional 
beta-Bernoulli process under the NB process framework to explicitly model zero counts. It is the 
same as the sparse-gamma-gamma-PFA (S7T-PFA) in [5] and is comparable to the focused topic 
model (FTM) (29), which is constructed from the IBP compound DP. Nevertheless, they apply about 
the same likelihoods and priors for inference. The Zero-Inflated-NB process improves over them by 
allowing pj to be inferred, which generally yields better data fitting, (iv) The Gamma-NB process 
explores the idea that the dispersion measure is shared across groups, and it improves over the NB- 
HDP by allowing the learning of pj. It reduces to the HDP [7| by normalizing both the group-level 
and the shared gamma processes, (v) The Beta-NB process explores the idea that the probability 
measure is shared across groups, and it improves over the beta negative binomial process (BNBP) 
proposed in [6| to allow the inference of rj. (vi) The Marked-Beta-NB process is comparable to the 
BNBP proposed in with the distinction that it allows analytical update of r^. The constructions 
and inference of various NB processes and related algorithms in Table [T] all follow the formulas in 



16s and ( 18 1, respectively, with additional details presented in the supplementary material. 



Note that as shown in (5|, NB process topic models can also be considered as factor analysis of 
the term-document count matrix under the Poisson likelihood, with oj^ as the fcth factor loading 
that sums to one and Xjk as the factor score, which can be further linked to nonnegative matrix 
factorization [35] and a gamma Poisson factor model [36]. If except for proportions and f, the 
absolute values, e.g., Xjk, rk and pk, are also of interest, then the NB processes based joint count 
and mixture models would apparently be more appropriate than the HDP based mixture models. 

5 Example Results 

Motivated by Table [T we consider topic modeling using a variety of NB processes, which differ on 
which parameters to leam and consequently how the VMR and ODL of the latent counts {rijk}j,k 
are modeled. We compare them with LDA l34l l37l and CRF-HDP J7] |26) . For fair comparison, 
they are all implemented with block Gibbs sampling using a discrete base measure with K atoms, 
and for the first fifty iterations, the Gamma-NB process with rj, = 50 /K and pj = 0.5 is used 
for initialization. For LDA and NB-LDA, we search K for optimal performance and for the other 
models, we set K = 400 as an upper-bound. We set the parameters as c = 1, 77 = 0.05 and 
ao = 60 = eo = fa = 0.01. For LDA, we set the topic proportion Dirichlet smoothing parameter 
as 50/ K, following the topic model toolbox 2 provided for 1371 . We consider 2500 Gibbs sampling 
iterations, with the last 1500 samples collected. Under the NB processes, each word Xji would 
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Figure 1 : Comparison of per-word perplexities on the held-out words between various algorithms, (a) With 
60% of the words in each document used for training, the performance varies as a function of K in both LDA 
and NB-LDA, which are parametric models, whereas the NB-HDP, NB-FTM, Beta-NB, CRF-HDP, Gamma- 
NB and Marked-Beta-NB all infer the number of active topics, which are 127, 201, 107, 161, 177 and 130, 
respectively, according to the last Gibbs sampling iteration, (b) Per-word perplexities of various models as a 
function of the percentage of words in each document used for training. The results of the LDA and NB-LDA 
are shown with the best settings of K under each training/testing partition. 

be assigned to a topic k based on both F(xji;uJk) and the topic weights {\jk}k=t,K> each topic is 
drawn from a Dirichlet base measure as uik ~ Dir(?y, • • • , 77) € R , where V is the number of unique 
terms in the vocabulary and 77 is a smoothing parameter. Let Vji denote the location of word Xji in the 
vocabulary, then we have (u k \~) ~ Dir(ri + £\ £\ s ( z ji = fe > v ji = 1)> ' • • > V + Ej Ei <K 2 ii = 
k, vji = V)). We consider the Psychological Review^] corpus, restricting the vocabulary to terms 
that occur in five or more documents. The corpus includes 1281 abstracts from 1967 to 2003, with 
2,566 unique terms and 71,279 total word counts. We randomly select 20%, 40%, 60% or 80% 
of the words from each document to learn a document dependent probability for each term v as 

fjv = Ef=i Ef=i "Ik AS/ELi ELi Ef=i where u vk is the probability of term v 

in topic k and S is the total number of collected samples. We use {fjv}j,v to calculate the per- 
word perplexity on the held-out words as in [5|. The final results are averaged from five random 
training/testing partitions. The performance measure is also similar to those used in [38, 39l l25l . 
Note that the perplexity per test word is the fair metric to compare topic models. However, when 
the actual Poisson rates or distribution parameters for counts instead of the mixture proportions are 
of interest, it is obvious that a NB process based joint count and mixture model would be more 
appropriate than a HDP based mixture model. 

Figure [T] compares the performance of various algorithms. The Marked-Beta-NB process has the 
best performance, closely followed by the Gamma-NB process, CRF-HDP and Beta-NB process. 
With an appropriate K, the parametric NB-LDA may outperform the nonparametric NB-HDP and 
NB-FTM as the training data percentage increases, somewhat unexpected but very intuitive results, 
showing that even by learning both the NB dispersion and probability parameters rj and pj in a 
document dependent manner, we may get better data fitting than using nonparametric models that 
share the NB dispersion parameters r& across documents, but fix the NB probability parameters. 

Figure [2] shows the learned model parameters by various algorithms under the NB process frame- 
work, revealing distinct sharing mechanism and model properties. When (rj,Pj) is used, as in the 
NB-LDA, different documents are weekly coupled with rj ~ Gamma(7o, 1/c), and the modeling 
results show that a typical document in this corpus usually has a small rj and a large pj, thus a large 
ODL and a large VMR, indicating highly overdispersed counts on its topic usage. When (rj,pk) is 
used to model the latent counts {rijk}j,k, as in the Beta-NB process, the transition between active 
and non-active topics is very sharp that pk is either close to one or close to zero. That is because p k 
controls the mean as E[Ej n jk] — Pk/0-~Pk) Ej r j an( ^ tne VMR as (1 — Pk)^ 1 on topic k, thus a 
popular topic must also have large pk and thus large overdispersion measured by the VMR; since the 
counts {rijk}j are usually overdispersed, particularly true in this corpus, a middle rage pk indicating 
an appreciable mean and small overdispersion is not favored by the model and thus is rarely ob- 
served. When (rk,Pj) is used, as in the Gamma-NB process, the transition is much smoother that 
gradually decreases. The reason is that r^- controls the mean as E[Ej n jk] = r k Ej Pj/ 0-~Pj) an< ^ 
the ODL r^ 1 on topic k, thus popular topics must also have large rk and thus small overdispersion 
measured by the ODL, and unpopular topics are modeled with small rk and thus large overdisper- 

"http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm 
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Figure 2: Distinct sharing mechanisms and modeling properties are evident between various NB processes, 
by comparing their inferred parameters. Note that the transition between active and non-active topics is very 
sharp when pk is used and much more smooth when is used. Both the documents and topics are ordered in 
a decreasing order based on the number of words associated with each of them. These results are based on the 
last Gibbs sampling iteration. The values are shown in either linear or log scales for convenient visualization. 



sion, allowing rarely and lightly used topics. Therefore, we can expect that (rk,Pj) would allow 
more topics than (rj,pk), as confirmed in FigurefTKa) that the Gamma-NB process learns 177 ac- 
tive topics, significantly more than the 107 ones of the Beta-NB process. With these analysis, we 
can conclude that the mean and the amount of overdispersion (measure by the VMR or ODL) for 
the usage of topic k is positively correlated under (rj,pk) and negatively correlated under (rk,Pj). 

When (rk,Pk) is used, as in the Marked-Beta-NB process, more diverse combinations of mean and 
overdispersion would be allowed as both Tk and pk are now responsible for the mean EEj n jk] — 
JfkPk I (1 — Pk) ■ For example, there could be not only large mean and small overdispersion (large 
and small p k ), but also large mean and large overdispersion (small and large pk). Thus (rk,Pk) 
may combine the advantages of using only or pk to model topic k, as confirmed by the superior 
performance of Marked-Beta-NB over the Beta-NB and Gamma-NB processes. When (jk^k) is 
used, as in the NB-FTM model, our results show that we usually have a small iTk and a large 
indicating topic k is sparsely used across the documents but once it is used, the amount of variation 
on usage is small. This modeling properties might be helpful when there are excessive number of 
zeros which might not be well modeled by the NB process alone. In our experiments, we find the 
more direct approaches of using pk or pj generally yield better results, but this might not be the 
case when excessive number of zeros are better explained with the underlying beta-Bernoulli or IBP 
processes, e.g., when the training words are scarce. 

It is also interesting to compare the Gamma-NB and NB-HDP. From a mixture modeling viewpoint, 
fixing pj = 0.5 is a natural choice as pj becomes irrelevant after normalization. However, from 
a count modeling viewpoint, this would make restrictive assumption that the same VMR of 2 is 
assumed for the each count vector {rijk}k=i,K> and the experimental results in Figure [T] confirm 
the importance of learning pj together with r^. It is also interesting to examine (fT31), which can be 
viewed as the concentration parameter a in the HDP, allowing the adjustment orpj would allow a 
more flexible model assumption on the amount of variations between the topic proportions, and thus 
potentially better data fitting. 



6 Conclusions 

We propose a variety of negative binomial (NB) processes to jointly model counts across groups, 
which can be naturally applied for mixture modeling of grouped data. The proposed NB processes 
are completely random measures that they assign independent random variables to disjoint Borel sets 
of the measure space, as apposed to the hierarchical Dirichlet process (HDP) whose measures on 
disjoint Borel sets are negatively correlated. We discover augment-and-conquer inference methods 
that by "augmenting" a NB process into both the gamma-Poisson and compound Poisson represen- 
tations, we are able to "conquer" the unification of the count and mixture modelings, the analysis of 
fundamental model properties and the derivation of efficient Gibbs sampling inference. We demon- 
strate that the gamma-NB process, which shares the NB dispersion measure across groups, can be 
normalized to produce the HDP and we show in detail its theoretical, structural and computational 
advantages over the HDP. We examine the distinct sharing mechanisms and model properties of 
various NB processes, with connections to existing algorithms, with experimental results on topic 
modeling showing the importance of modeling both the NB dispersion and probability parameters. 
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A Generating a CRT random variable 

Lemma A.l. A CRT random variable I <~ CRT(m, r) can be generated with the summation of 
independent Bernoulli random variables as 

1 = 6„~ Bernoulli f ^— — ) . (20) 



n=l 



Proof. Since Z is the summation of independent Bernoulli random variables, its PGF becomes 

C L ( Z ) = fj ( "7 1 + ; ^ = r , T(r) : Y \8(m, k)\(rz) k . 

w l\\n-l+r n-l + r J Tim + r) ^ nK ' 

71=1 V / \ I k=Q 

Thus we have f L (l\m,r) = C ' L V {0) = r ^+ r) \s(m, l)\r l , I = 0, 1, • • • ,m. 



□ 



B Dir-PFA and LDA 

The Dirichlet Poisson factor analysis (Dir-PFA) model [5] is constructed as 

Xji ~ F(uj Zji ), uj k ~ Dir(?7, • • • ,rf) 

K 

N 3 = n J fc ' n J k ~ Pois ( V), *i ~ Dir(50/X, • • • , 50/A - ) (21) 
fe=i 

where 77 is the Dirichlet smoothing parameter for the topic's distribution over the vocabulary, rijk — 

J2f=i 3( z ji = k)> anc l tne data likelihood F{xj i - 1 LOk) in topic modeling is w^fc, the probability of 
the ith word in jth document under topic 

The Dir-PFA has the same block Gibbs sampling as LDA [34], expressed as 

Vr{zji = k\—) oc F(xji;u>k)Xjk 

I J N 3 j N 3 

(wfe|-) ~ Dir T] + ^2^2S(zji = k,vji = !),■■■ ,V + ^2^2d(zji = k,v Ai = V) 

\ J=l »=1 j=l »=1 

(Aj|— ) - Dir (50/X + rij-i, • • • , 50/if + ra j7f ) • (22) 
C CRF-HDP 

The CRF-HDP model [7, 26] is constructed as 

Xji ~ F(w Zji ), uj k ~ Dir(r?, • • • ,77), z,-* ~ Discrete(Aj) 

\j <~ Dir(o!r), a <~ Gamma(a , l/&o), r ~ Dir(7o/-K", ■ " " >7o/-fO- (23) 

Under the CRF metaphor, denote rijfe as the number of customers eating dish fc in restaurant j and 
ijfe as the number of tables serving dish k in restaurant j, the direct assignment block Gibbs sampling 
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can be expressed as 



Pr(zji = k\—) oc F(x 3i ;uj k )X jk 

( N- 

(ljk\— ) ~ CRT(rijfe, ark), Wj ~ Beta(a + 1, TV,-), Sj ~ Bernoulli I — — - — 

\ JMj + a 

I J k j 1 \ 

a ~ Gamma a + E E ^ " E** bo - E . mWj 
\ 3=1 fe=i i=i 3 J / 

( J J ^ 

(f |-) ~ Dir 7o /X + E iji. • • • , 7o/^ + E 

\ 3=1 3 = 1 

~ Dir (or i + riji, • • • , afjc + n jAr ) 

/ ,7 ^ .7 ^ 

M-) ~ Dir U + E E ^ = fc ' w ^ = ' ' ' ' ^ + E E ^ = u ^ = y ) ) • (24) 

V J=l *=1 J=l »=1 



When X — » oo, the concentration parameter 70 can be sampled as 



w - Beta 7o + 1, 2^ 2^ J fc ' n ° 



3 = 1 fc=l 



(/o-lnw )E/=iEfeLi^fc 



70 ~ 7r Gamma ( e + if + , ^ ] + (1 - 7r )Gamma [ e + if + - 1, 

V Jo - In w / V fo - m w 



(25) 



where K + is the number of used atoms. Since it is infeasible in practice to let K — > 00, directly 
using this method to sample 70 is only approximately correct, which may result in a biased estimate 
especially if K is not set large enough. Thus in the experiments, we do not sample 70 and fix 
it as one. Note that for implementation convenience, it is also common to fix the concentration 
parameter a as one [25]. We find through experiments that learning this parameter usually results in 
obviously lower per-word perplexity for held out words, thus we allow the learning of a using the 
data augmentation method proposed in [7], which is modified from the one proposed in [24]. 



D NB-LDA 



The NB-LDA model is constructed as 



u 3* 



F(uj z ), uj k ~ Dir(r7, • • • ,77) 



K 



N 3 = E n J k > n J k ~ Pois ( A ifc)' ~ Gamma(r J -,p j /(l - pj)) 



k=l 



Tj ~ Gamma(7o, l/c), pj ~ Beta(a , & ), 7o ~ Gamma(e , l// ) (26) 



Note that letting rj <~ Gamma(7 , l/c), 70 ~ Gamma(e , l//o) allows different documents to 
share statistical strength for inferring their NB dispersion parameters. 
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The Gibbs sampling inference can be expressed as 



Pr(zji = k\-) oc F(xji;Lo k )Xjk 




1 




(wfe|-) ~ Dir r] + ^2^2S(zji = k,Vji = !),••• ,V + ^2^2^(zji = k , v ji = V) I ■ (27) 



E NB-HDP 



The NB-HDP model is a special case of the Gamma-NB process model with pj = 0.5. The hier- 
archical model and inference for the Gamma-NB process are shown in (16) and (18) of the main 
paper, respectively. 



F NB-FTM 



The NB-FTM model is a special case of zero-inflated NB process with pj = 0.5, which is con- 
structed as 



xji ~ F(uj Zji ), uj k ~ Dir(?7, • • • ,77) 



K 



Nj = ^2 fi jk , rijk ~ Pois(Ajfe) 



fc=i 

Xjk ~ Gamma(rfe6jfe, 0.5/ (1 — 0.5)) 
r fc - Gamma(7o, 1/c), 70 ~ Gamma(e , l//o) 
bjfe ~ Bernoulli (iTk), ^fe ~ Beta(c/if, c(l — 1/K)). 



(28) 
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The Gibbs sampling inference can be expressed as 

Pr(z,-j = fc|-) cx F{x ji \ui k )\j k 

(7r fe (l — 5) rfc \ 
7r fc (l-0.5)-» + (l-7r fc ) ) + 6{n > k > 0) 

( \ - &ik M 1 - 0.5) 

~ Beta^c/A + g &jfe , c(l - + J - g bjk ) , P > k = c _ £ ^ ln(1 _ Q . 5) 

(ijfcl-) - CRT(n jfe ,r fe 6 jfe ), (Z' fe |-) - CRT ( ^ fe ,7o 



\3 = 1 



(To|— ) ~ Gamma e + ^ l 'k. 



1 



k 
J 



(r k \-) ~ Gamma 7o + ]T J jfe , — 7 — 

\ J= i c-Ej=i^feMi-o.5)^ 

(Ajfej— ) ~ Gamma(r fe 6 jfe + n jfe , 0.5) 

(J JVj .7 
j=l i=l j = \ i=\ 

G Beta-NB 

The beta-NB process model is constructed as 

xji ~ F{u) Zji ), u} k ~ Dir(?7, ■ • • ,7?) 

K 

N 3 = X! n J fe ' n ^ ~ Pois ( X jk), Ajfe - Gamma(r 3 , p k /(l - p k j) 
fe=i 

r, - Gamma(e , l// ), Pfc - Beta(c/if, c(l - if)) (30) 
The Gibbs sampling inference can be expressed as 

Pr(zjj = fc|-) cx F(xji;oj k )Xj k 

(Pfc|-) - Beta | c/A" + ^ n jfc , c(l - I/A - ) + ^ r j , l jk ~ CRT(n jfe , r,) 
(r,-|-) - Gamma I e + ^ l jk , K 1 I 

V fc= i fo -Efc=i ln (i -Pk)J 

(Ajfe|-) ~ Gamma(r j + n jk ,p k ) 

(wfc|-) ~ Dir I r? + ^ ^ = fc, vji = 1), • • • , »7 + X! X! = v i i = V ) \ ■ ( 31 ) 

H Marked-Beta-NB 

The Marked-Beta-NB process model is constructed as 

x jt ~ F(u z ..), uj k ~ Dir(?7, • • • ,77) 

K 

Nj = y^n jk , n jk ~ Pois(Ajfc), A jfe ~ Gamma(r fe ,f> fe /(1 - 
fe=i 

r fe ~ Gamma(e , l// ), p fe - Beta(c/A, c(l - A)) (32) 
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The Gibbs sampling inference can be expressed as 

Pi(zji = k\-) oc F(xji;oj k )^jk 

p k ~ Beta ^c/K + Y^n jk ,c(l - l/K) + Jr k J , l jk ~ CRT(n jk ,r k ) 

(r k \-) ~ Gamma ^„ + g ^ - _ ^ _ ^ 
(•\7fcH ~ Gamma(r fc + n jk ,p k ) 

I J N } J Nj 

(u k \-) I + s ( z ji =k,v ji = l),--- + S ( z ^ = k > v i i = V ) \ ■ (3 3) 

\ 3=1 *=1 3=1 i=l 

I Marked-Gamma-NB 

The Marked-Gamma-NB process model is constructed as 

xji ~ F(u Zji ), uj k ~ Dir(?7, • • • , rf) 

K 



N 3 = X^ fc ' n J fc ~ Pois ( A jfc), Ajfe ~ Gamma(r fc ,p fe /(1 -p fe )) 
r fe - Gamma(7o/if, 1/c), _p fe - Beta(a , 6 ), 70 ~ Gamma(e , l//o)- (34) 
With j/ fe = "l'"^" , the Gibbs sampling inference can be expressed as 

Pi(zji = k\-) oc F(xji;uj k )Xj k 



p k 



(v~^ I / — Jln(l — pfe) 

ao+^n jfc ,6o + Jr fc l, Pfc = c _ J]n(1 _ pfc) 



A' 



~ CRT(n jfc , r fc ), 4 ~ CRT(J^ Z j7s , 70/^), 70 ~ Gamma e + ^ ^, — — 

j=i V fc=i /o - E fc =i ln (l - V k )IK / 



/ 1 \ 

Gamma I 70/if + ^ Z jfe , ^_ j m (i _ Pfc ) J ' ( A J fe H ~ Gamma(r fc + n jk ,p k ) 



(wfe|-) - Dir ( 77 + = = 1 )>"" " ' r ? + X]X! (5 ( 2: J i = fc,t; J i = V ) ) ■ ^ 35 ^ 

j — 1 i—1 j — 1 i—1 
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